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Executive Summary 


Since 2010, teacher count in Illinois public schools have decreased by roughly 15%, and has shown 
no signs of stopping[1]. As the number of teachers falls, the effects are reflected in the enrollment 
and performance of students. Since 2010, the number of students enrolled in CPS has fallen by 
14.6%, and since the changing of the SAT grading scale in 2017, the average student score has 
fallen by 5.5%[2][3]. It has become strikingly clear that this teacher shortage issue has impacted 
and will continue to impact the future generations of our society. Our model aims to identify the 
main causes of teacher shortages, predict future trends, and propose viable recommendations to 


address the problem. 


To determine the severity of the teacher shortage problem, we first needed to establish a forecasting 
model. We used a Fourier transform to find the most appropriate function to represent our data, 
followed by adding linear component to model the consistent growth in teacher shortages. Then, 
utilizing gradient boosting, we fitted the curve and got viable coefficients, which revealed an overall 
periodic growth in teacher vacancies. This indicates that teacher shortages is a problem that is 
worsening over time. We decided on the 7 main factors that led to teacher shortage: student 
enrollment, student daily attendance, student misconduct, student ELA performance, student math 
performance, teacher salary, and the 5Essential leadership index[4][5][6]. Data of these factors for 
each active school in the CPS was then put into a Random Forest algorithm to determine which 
factors contributed the most to teacher shortage. We concluded that the main factors contributing 
towards teacher shortages was too much student enrollment, student misconduct, and poor teacher 


leadership. 


With these predictions, we were able to quantify the risk posed by teacher shortages. We first 
modeled our factors in a partial dependence plot to determine critical thresholds that would put 
schools in danger of facing a teacher shortage. We then created our own Risk Index calculation. 
First, it takes each of the seven values and correlates it to the probability based on our partial 
dependence plot. We then multiplied it by an independently determined impact index to calculate 
its Risk Index. By comparing our index with real world data, we determined that our calculation 


was accurate and could be applied throughout CPS. 


Based on our analysis as well as our Random Forest algorithm, we outlined policy recommendations 
for the Chicago Public Schools System that will mitigate teacher shortage problems. We centered 
these recommendations upon the three main factors: excessive enrollment, student misconduct, and 
poor leadership. For enrollment, we found that teachers generally like low enrollment as it supports 
student-teacher relationships, so we recommend that schools engage in active promotion of a positive 
learning environment with frequent communication. For misconduct, we found that the easiest way 
to minimize misconduct was to remove detention and instead implement rehabilitation for students. 
For leadership, we decided to increase the requirements to become a school administrator in order 
to ensure quality administration of schools. Overall, for a financial plan to make these changes 
possible, we suggested that schools lower the allocated budget for these changes during years where 
teacher shortage dipped below the linear regression, such as from now to 2024, and increase the 


budget when teacher shortage rates rose above the linear regression, such as from 2024 to 2032. 
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1 Background 


The night of November 20th, 2022 was a joyful one for the students at Dawes Elementary School. 
Students in Skokie and Evanston, Illinois, were overjoyed to learn that their Thanksgiving break 
was extended by two days. While this news brought delight to the children, it created a nightmare 
for the administrators in their schools[7]. The real reason behind a prolonged break was none other 
than the ongoing teacher shortage crisis — but Dawes Elementary School isn’t the only school 
facing this pressing issue. All throughout the Chicago Public School District (CPS), administrators 
are forced to cut courses, increase class sizes, and in turn, decrease the quality of education students 


receive due to a lack of teachers[8]. 


Many people point to COVID-19 for being the leading cause of teacher shortage, and argue 
that the crisis will resolve itself as the pandemic dies out. However, evidence of the current shortage 
in educational staff has been present for over a decade, and the pandemic has only recently brought 
attention to the matter. From 2010 to 2019, CPS saw its teacher population decline 5.5%[1], a 
substantial number considering the sheer size of the district. Thus, it becomes clear that this issue 
is here to stay unless we confront it. As the pandemic recedes and schooling systems return to 


normalcy, the teacher crisis still holds true. 


There are several reasons why teachers are leaving the industry en masse. Some cite low pay: 
certain schools have such a low minimum wage that the average McDonald's worker is earning more 
than an educator. Others cite that the students are the problem: student performance is directly 
correlated to a teachers willingness to teach. In some cases, the shortage is even attributed to the 
school administration: poor administration leads to an unwillingness to teach. Regardless of the 


underlying reason, this issue needs to be addressed.[9][10][11]. 


There exist many ways to mitigate this issue. Schools have made countless efforts to incen- 
tivize teachers to stay by improving the workplace environment. The CPS Board increased teacher 
salaries drastically by 16% over a five year term|12]. They also provided financial incentives to 
the highest performing teachers[13]. In fact, some schools in the district are even lowering the bar 
and dropping qualification requirements to become a teacher|14]. However, these are often only 


temporary solutions that do not address the main causes of this shortage. 


Our paper aims to determine the relative importance of the following factors contributing to 
the teacher shortage crisis: student enrollment, student daily attendance, student misconduct, 
student ELA performance, student math performance, teacher salary, and the 5 Essentials leadership 
index. Furthermore, we will predict the number of teachers that will be in the CPS district in 
future years if current trends continue. This will help the CPS board better target mitigation 


efforts in the right places in order to truly solve this problem. 


So, while students at Dawes Elementary School may appreciate their extra days off, two days will 
eventually turn into two weeks, and two weeks into two months if the teacher shortage continues to 
worsen. Without proper education, the futures of many bright children will be jeopardized, and 


they will be unable to live a successful and fulfilling life. 
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2 Assumptions 


1. Teacher vacancies in schools are determined by comparing student-teacher ratios to the ideal 


student-teacher ratio, 18:1. 


e Justification: Schools may have differing policies that have result in different optimal 
ratios, but looking into each school’s curriculum is beyond the scope of our analysis and 
would prevent us from finding significant results. It is reasonable to assume all schools 
want to achieve the ideal ratio of 18:1. [15]. 


2. We will only analyze data prior to COVID-19 (specifically, 2014-2019). 


e Justification: As previously established in our background, teacher shortage issues have 
existed for years before COVID-19 began. The impacts that COVID-19 has specifically 
had on teacher attrition rates have begun to recede as schools are returning to normal|16]. 
Thus, the only major lingering impacts of COVID-19 on the education system existed 
prior to the pandemic and were simply brought to the spotlight by the pandemic. Due 
to this, we will neglect data from the COVID-19 period in order to better address and 


analyze long-term problems present in the education system. 
3. We will only account for schools that have been open every year from 2014-2019. 


e Justification: Many schools exit and join CPS each year, so to maintain consistent data 
and obtain consistent trends, we only account for schools that have been around during 
our whole analysis time frame. Our data sets begin from 2014 as that is the time the US 
job market fully recovered from the Great Recession of 2007-2009|17]. This recession 
severely effected the economy and job market, and we believe it would skew our results 
if we included this anomalous event in our past data. Furthermore, we ended the data 
set at 2019, as that was the beginning of COVID-19, which also is an anomalous event 
which would skew off our datal18]. 


4. We did not account for the effects of inflation. 


e Justification: The average inflation rate in the United States from 2014-2019 was 
around 1.8%|19], which is within the typical range of inflation fluctuations that leads 
to the natural rate of unemployment(around 2% inflation[20]). Therefore the impact of 
inflation on the data is very small and can be safely ignored as it will not impact teacher 


employment greatly. 
5. We did not account for people with special needs. 


e Justification: People with such special needs also need special services in order to learn 
- this includes specialized teachers and a smaller class size[21]. As such, we will not 


account for them because the average student does not require such resources. 
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3 Data Methodology 


We draw data from three primary sources on the seven factors affecting teacher employment. The 
sources include the Illinois State Board of Education (ISBE)[4], Chicago Public Schools (CPS)[5], 
and the University of Chicago (UChicago)[6]. These are all reliable sources as they all directly 
deal with the local area around CPS and are the official education governing bodies dealing with 
education within CPS. The seven factors we looked at within these sources include student en- 
rollment, student daily attendance, student misconduct, student ELA performance, student math 
performance, teacher salary, and the 5Essential Leadership Index. All of this data is relevant to 
determining the extent in which each of the eight factors contribute to teacher shortages, as well as 


predicting future trends of teacher shortage in CPS as a whole. 


It is important to note that no data was found on the number of vacancies nor the types of 
teacher vacancies per school (i.e. science, math, reading, etc.). Thus, instead of looking at this on 
an individual level, we choose to address teacher shortages by school and discuss what characteristics 


of a school prevent a teacher shortage at their school. 


We got our data ready for modeling in many ways. We first took the data and removed the 
data for schools that were not open for the entirety of 2014-2019 per Assumption 2. The main 
obstacle was the way different data sets classified schools. Some used school ID, others used the 
school abbreviated name, and others used the full name of the school. To organize all the data 
sets so that they match in school reference, we found a data set provided by CPS that linked each 
school’s short name, long name, and ID. Then, we imported this into Excel and ran a matching 
program, which converted each data set with short names and long names into school IDs. Following 
this, we utilized a secondary matching algorithm to combine the data sets, organizing everything 
into a final data set which contained each school ID and each corresponding factor values. Finally, 
we dropped schools with missing data for any of the factors because our random forest it could 
lead to bias of selection. We opted for this instead of substituting empty values with the average of 
non-empty values to also void bias. In the end, this turned out to be 476 rows of the 3828 rows we 


choose to analyze. This allows our data set to be easily and efficiently utilized in our model. 


Class Size from ISBE Illinois Report Card|[4] 


e Motivation: There is no existing teacher vacancy data. The closest comparison to reveal 


whether a not a shortage exists is average class size. 


e Parameters: This data set provides us with the average students per class in a given school 
in the CPS district. 


e Purpose: We use the average number of students per class to determine the student-teacher 
ratio. Public schools want the ideal student-teacher ratio of 18:1[22], which is a good balance 
between cost cutting and student performance. Thus, if the student-teacher ratio is greater 


than 18:1, teacher vacancies exist in the school. [22] 


Student Enrollment from ISBE Illinois Report Card|[4] 
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Motivation: Student enrollment affects the need for a teacher. This, in turn, determines 


whether there is or is not a teacher shortage. [23] 


Parameters: This data set provides us with the number of student enrollments in a given 
school in the CPS district over the years 2014-2019. 


Purpose: We use student enrollment data in to see how an increase/decrease in students 
impacts the number of teachers required to operate a school. A decrease in teacher shortages 


can be attributed to a a decrease in student enrollment. The opposite holds true as well. 


Student Daily Attendance from ISBE Illinois Report Card|!] 


Motivation: Student attendance reflects student engagement and commitment, and thus, 


teacher morale. This directly affects a teacher’s willingness to continue working. [24] 


Parameters: This data set provides us with the average percentage of students present in a 


given school throughout a given year. 


Purpose: We use student attendance data to see how student commitment and participation 
impacts teacher shortages. A decrease in teacher shortages can be attributed to increased 


student commitment and in turn, increased student attendance. The opposite holds true. 


Student Misconduct from CPS Metrics[25] 


Motivation: Student misconduct affects a teacher’s safety and thus, willingness to continue 


working. 


Parameters: This data set provides us with the number of times in a school year that a 
school reports behaviors that violate the Student Code of Conduct. We took the values given 
by the data and further divided by the school enrollment in order to account for the fact 


bigger schools might report more incidences of misconduct. 


Purpose: We use student misconduct data to see how student behavior impacts teacher 
safety. A decrease in teacher shortage can be attributed to increased teacher safety and in 
turn, a decrease in student misconduct. The opposite holds true as well. We decided not to 
use suspension and expulsion data as misconduct captures a wider range of behaviors that 
prove to be disruptive. Expulsion and suspension are both less likely to be given out by 


schools as well, leading to 0 from the majority of the schools. [26]. 


Student ELA Performance from ISBE Illinois Report Card|!] 


Motivation: Student ELA performance determines a teacher's teaching satisfaction and 


thus, willingness to continue working. 


Parameters: This data set provides us with the percentage of students in a given school and 


a given year who meet the national average in ELA. 


Purpose: We use student ELA performance data to see how student proficiency impacts 
teacher satisfaction. If student preforms extremely poorly, a teacher may feel less motivated 


to teach them, and this may lead to increased teacher shortage[27]. 
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Student Math Performance from ISBE Illinois Report Card|{4] 


e Motivation: Student math performance determines a teacher’s teaching satisfaction and 


thus, willingness to continue working. 


e Parameters: This data set provides us with the percentage of students in a given school and 


a given year who meet the national average in math. 


e Purpose: We use student math performance data to see how student proficiency impacts 
teacher satisfaction. If student preforms extremely poorly, a teacher may feel less motivated 


to teach them, and this may lead to increased teacher shortage[27]. 
Teacher Salary from CPS Employee Position Files[2s] 


e Motivation: The amount of salary directly affects how financially satisfactory the job is and 


thus, a teacher’s willingness to stay. 


e Parameters: This data set provides us with the average yearly salary of a given teacher in a 


given school. 


e Purpose: We use teacher salaries to see how increased salaries increase teacher financial 
security. A decrease in teacher shortages can be attributed to increased financial security and 


in turn, increased salary. The opposite holds true as well[29]. 
5Essential Leadership Index from ISBE 5Essentials[30] 


e Motivation: The 5Essential Leadership Index is determined by the University of Chicago 


through the statewide 5Essentials survey given to students, teacher, and parents. 


e Parameters: This data set provides us with an index that evaluates the performance of a 


given school’s administration (principals, vice principals, etc.). 


e Purpose: We use this index to see how proficient leadership in the school administration 
increases teacher stability. A decrease in teacher shortages can be attributed to a higher index 


score of a given school. The opposite holds true as well[31]. 
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4 Mathematical Methodology 


In our model, we seek to predict the future trends of teacher shortages and determine the relative 
importance of the chosen factors on causing teacher shortages from 2014-2019. We first use a Fourier 
Transform to determine a regression which is further optimized through gradient boosting. We 
had considered other regressions such as exponential smoothing and an auto-regressive integrated 
moving average (ARIMA) model, but were limited by the number of data points and the noisiness 
of the data. Then, we use the random forest feature importance algorithm to determine the most 
important factor contributing to teacher shortages in each year from 2014-2019. Next, we combine 
all the data from each year and run the random forest feature importance algorithm on the large 
data set. We then use a partial dependence plot to determine how large the value of a factor has 
to be to have no effect on teacher shortages. The partial dependence plot also determines the 
interval of values for which the factor impacts teacher shortages. We created partial dependence 
plots for the most relatively important factor in each year from 2014-2019. We also created partial 
dependence plots for every factor from the combined data set of all six years. Note that for our 
data, years such as ”2014” refers to the school year ” 2013-2014”. 


4.1 Fourier Transform and Gradient Boosting 


Our past data for student-teacher ratios varies greatly, constantly alternating from values around 23 
to values around 19. Therefore, given this fluctuating data set, we decided upon a Fourier transform 
to formulate a pattern to generalize and model these trends. Utilizing a Fourier transform, we found 
that our data could be closely broken down into 4 fundamental frequencies and a linear component, 
to adjust for the overall slow increase in the ratio over time. Writing out these 4 sine and cosine 


functions alongside the linear increase, the general equation for our model was found to be 


a+ bx + csin(dx + e) + f cos(dxz + e) + gsin(hx) + h cos(gx) (1) 


Following this, to fit our function closely to this model, we utilized a gradient boosted fitting method. 
For our loss function, we found that square loss functions outperformed logarithmic hyperbolic 
cosine functions and absolute loss functions to model our code. We suspect this is due to the rapid 
acceleration of the square loss function as compared to others, such as absolute loss functions. Large 
outliers and errors in our predictions would significantly skew our final results and predictions, and 
would be easily recognizable in the face of our smaller oscillating function. Therefore, it makes 
sense that we found from reasoning and graphical interpretation that the squared loss function 
provided less error in the 20 year prediction range we performed. By using this gradient boosting, 


we found that the closest fit can be graphed and set as the following. 
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Predictions for Future Student to Teacher Ratio for Elementary Predictions for Future Student to Teacher Ratio for High School 
o 
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Figure 1: Fourier Transform for CPS Figure 2: Fourier Transform for CPS 
elementary schools elementary schools 
Variable Value Variable Value 
a 100.793183 a 18.31325684 
b 0.115572800 b 0.1295682 
Cc 54.1830283 c 0.54045429 
d 0.0000239402306 d 0.73775414 
e 5.05552528 e 6.52288523 
f -86.2378408 f -0.67572896 
g 1.84024119 g 1.1881641 
h 0.450852865 h 1.02244719 
Table 1: Variable Values for Equation 1 Table 2: Variable Values for Equation 1 


These results make sense because, as mentioned in the background, the solutions that are generally 
implemented by CPS currently are band-aid solutions that only solve the problem short term and 
do nothing to aid the issue long term. However, as seen by the linear regression that acts as the 
base axis of our Fourier transform, the general trend of the student-teacher ratio is still increasing, 
showing that this is a problem that will continuously get worse in the future. Thus, this is a problem 


we need to fix. 


4.2 Random Forest Feature Importance 


The random forest feature importance algorithm determines the relative importance of each of the 
seven factors for each year from 2014-2019. We also use this algorithm on the combined data set. 
Thus, we produce seven such models depicting the importance of each factor. We use an 80-20 train 
test split to implement the random forest feature importance algorithm. This means we use 80% of 
the data set to train the decision trees and 20% of the data set to test the model and achieve the 
relative feature importance. The random forest algorithm will help us determine which factors we 


need to target our mitigation efforts on based on which factor impacts teacher shortage the most. 


4.2.1 Variables 
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Factor 


Units 


Class size 


Student to teacher ratio 


Student enrollment 


Students enrolled 


Student daily attendance 


Average percentage 


Student misconduct 


Misconducts per student 


Student ELA performance 


Percentage proficient 


Student math performance 


Percentage proficient 


Teacher salary 


Dollars 


5Essential Leadership Index 


Index value 


Table 3: Variables for our mathematical models 


4.2.2 Results 


The results we return from the algorithm are depicted in the following graphs: 
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All Years 
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Random Forest Feature Importances for Years 2014-2019 


In four of the six years that we ran the random forest feature importance, the most relatively 
important factor was enrollment (2014, 2015, 2016, 2019). In 2017 and 2018, the most relatively 


important factor was daily student attendance, and the 5Essential Leadership Index, respectively. 


The random forest algorithm was responsible for classifying vacancies based on the factor data. The 
random forest results were then compared with existing vacancies to determine whether the model 
correctly predicted the existence of a vacancy. The accuracy score of the combined relative feature 
importance of our Random Forest algorithm is 0.8987. This means the random forest algorithm 
correctly predicts a vacancy 89.87% of the time. We find a mean absolute error (MAE) of 0.1013, 
and a root mean square error (RMSE) of 0.3183, meaning our data is of high quality. 


4.2.3 Strengths 


A strength of our model is that random forest is highly robust and resistant to over-fitting, which 
means it can handle noisy data and outliers effectively. As seen with trends in our factors, some 
years have unexpectedly high or low values that can't be adequately explained so we use multiple 


The algorithm is computationally efficient and can handle larger data sets. 


4.2.4 Weaknesses 


A weakness of our model is its bias towards categorical variables: Random Forest feature importance 
may be biased towards categorical variables with many levels. This is because each level of a 
categorical variable is treated as a separate feature, which can lead to an over-representation of 


categorical variables in the feature importance ranking. 


4.3 Partial Dependence Plot 


A partial dependence plot is a powerful visualization tool that we use to show the relationship 
between a factor variable with our dependent student to teacher ratio. In this, we chose to utilize 
one way partial dependence plots to ensure easy understanding and visual interpretation. We can 


model this with the equation 


PD(x) = Elx) (2) 
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where PD(x) is the partial dependence of student-teacher ratio on a single factor variable x, E is 
the expected value operator, and f'(X, 1) is the random forest model's predicted outcome for the 
set of predictor variables X’, where the value of x has been fixed to a specific value. With our data 
points regarding partial dependence, we had a discrete set of values. To turn this into a continuous 
graph which can be visualized, we utilized gradient boosting to fit a curve through the discrete data 
points. This was done using the sklearn python programmed, which had readily available partial 
dependence plot gradient boosting. 


4.3.1 Results 


The values of local extrema in each of the factors represents critical values which are thresholds 


of putting schools at risk. The results we return from the algorithm are depicted in the following 


graphs: 
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All Years 


Partial dependence 
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Partial Dependency Plots for Years 2014-2019 


4.4 Sensitivity Analysis 
4.4.1 Permutation Feature Importance 


We analyzed the sensitivity of our variables using a permutation feature importance algorithm. This 
algorithm is another method of determining the relative feature importance. It works by randomly 
permuting the values of each factor in the test data set and measuring how much the permutation 
reduces the model’s accuracy. If a feature is important to the model, then permuting its values 
should lead to a significant decrease in the model’s performance. On the other hand, if a feature 
is not important, permuting its values should not have much effect on the model’s performance. 
Model performance is determined by computing the difference between the original accuracy and 


the accuracy on the permuted data set. 


4.4.2 Results 


We conduct a sensitivity analysis on every factor and added an error bar that represents one 
standard deviation above and below the calculated value. Since in every plot except 2018 the error 
bar of the leading cause does not overlap with the error bars of the other factors, we find that our 
results are statistically significant and that those specific factors are indeed the most significant 
cause of teacher shortage that year. While 2018 does not follow the above, the top 3 factors do as 
their error bars do not overlap with those of the other factors. Thus, we decided to analyze only 


the top 3 factors as those factors are statistically significant for all years. 
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5 Risk Analysis 


5.1 Risk Model Development 


Teacher shortages are devastating to both the schools and the communities it impacts. The potential 
losses that people will experience from teacher vacancies can range greatly. As such, we aim our risk 
analysis at determining which schools and which communities are most at risk from the ongoing 
teacher shortage. The main risk present is the development and future of the children attending 
school. Therefore, we define a risk index that evaluates the total risk present to the future of a 


student attending a school. This risk index can be calculated using the following formula 


Risk Index = P, x Ia + Po * Ie + Pix Iı + Pm * Im + Pa * La + Pma * Ima (3) 


where the variables mentioned are the ones present in the following table. 


Variable Meaning 

Pa Probability of Teacher Shortage Impacting Attendance 

Pe Probability of Teacher Shortage Impacting Enrollment 

P; Probability of Teacher Shortage Impacting Leadership 

Pra Probability of Teacher Shortage Impacting Misconduct 

Pa Probability of Teacher Shortage Impacting ELA Performance 
Pang Probability of Teacher Shortage Impacting Math Performance 

Iq Impact of Attendance on the Future and Development of Students 

I. Impact of Enrollment on the Future and Development of Students 

I Impact of Leadership on the Future and Development of Students 

Im Impact of Misconduct on the Future and Development of Students 

La Impact of ELA Performance on the Future and Development of Students 
Lina Impact of Math Performance on the Future and Development of Students 


Table 4: Variable Values for Equation 3 


Variable | Value | Reasoning 


Iq 5 We determined this value because although personal attendance matters 
greatly towards a student’s individual performance, the attendance of other 
people should not impact another student's development greatly.[32] 


I. 10 Enrollment matters greatly because the more students there are, the less 
time a teacher can spend on each individual student, greatly impacting a 
student’s development.[33] 


I 5 Leadership decisions often do not have a direct impact on student perfor- 
mance, though some prominent ones may arise through actions such as 
policies changing a student’s environment. [34] 


In 10 Student misconduct can bring about a negative school environment for 
other students, greatly impacting the way in which they develop. [35] 


Ia 15 English performance scores directly correlate to and measure how much 
students have developed. 


Lia 15 Math performance scores directly correlate to and measure how much 


students have developed. 


Table 5: Variable Values for Equation 3 
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To determine the probability values, we can utilize the partial dependence plots found in our earlier 
model. The partial dependence plot gives us how each factor is connected to teacher shortages 
when the factor is at a certain value. This relationship can be reversed, and we see the influence 
of teacher shortages on each factor is related by the same partial dependence. Therefore, a value 
of a factor with a lower partial dependence on teacher shortages implies that the probability of 
teacher shortage impacting that factor is lower. This justifies us to set each probability per factor 


per school to be the partial dependence value corresponding to that factor. 


5.2 Results 


Utilizing values from our data sets as well as Equation 3, we found the highest and lowest risk 


schools in CPS. The following tables represent the values 


Factor Value | Partial Dependence | Risk Index 
Attendance 93.0 0.82396 4.1198 
Enrollment 314 0.78826 7.8826 
Leadership 82.0 0.77340 3.867 
Misconduct 1.038 0.83284 8.3284 

ELA Performance | 23.951 0.79469 11.92035 
Math Performance | 57.259 0.85498 12.8247 
Total 48.94285 


Table 6: Chicago Math and Science Academy Charter School (Low Risk) 


Factor Value | Partial Dependence | Risk Index 
Attendance 93.0 0.82251 4.11255 
Enrollment 275 0.78826 7.8826 
Leadership 86.0 0.75027 3.75135 
Misconduct 1.185 0.82325 8.2325 

ELA Performance | 22.169 0.77503 11.62545 
Math Performance | 56.533 0.82391 12.35865 
Total 47.9631 


Table 7: Theodore Roosevelt High School (Low Risk) 


Factor Value | Partial Dependence | Risk Index 
Attendance 95.9 0.92121 4.60605 
Enrollment 685 0.89183 8.9183 
Leadership 15.0 0.88756 4.4378 
Misconduct 0.476 0.89021 8.9021 

ELA Performance | 59.388 0.86634 12.9951 
Math Performance | 53.804 0.85783 12.86745 
Total 52.7268 


Table 8: Academy for Global Citizenship Charter School (High Risk) 
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Factor Value | Partial Dependence | Risk Index 
Attendance 95.9 0.92121 4.60605 
Enrollment 685 0.89183 8.9183 
Leadership 15.0 0.88756 4.4378 
Misconduct 0.524 0.86937 8.6937 

ELA Performance | 60.236 0.87129 13.06935 
Math Performance | 51.124 0.85872 12.8808 
Total 52.606 


Table 9: Catalyst Elementary Charter School - Circle Rock (High Risk) 


5.3 Analysis 


From our risk index, we found that the Chicago Math and Science Academy Charter School and the 
Theodore Roosevelt High School have the lowest risk index in CPS, and the Academy for Global Cit- 
izenship Charter School and the Catalyst Elementary Charter School - Circle Rock had the highest 
risk index in CPS. This aligns closely with what is true, as the Chicago Math and Science Academy 
Charter School is ranked as the 4th best charter school in Illinois, and the Theodore Roosevelt 
High School is ranked 78 out of all CPS schools. This confirms our index model, as they clearly 
are very good schools, likely due to the low-risk factor we determined during this teacher shortage 
crisis.[36][37] Furthermore, the Catalyst Elementary Charter School - Circle Rock is ranked 2038 out 
of all elementary schools in Illinois, and the Academy for Global Citizenship Charter School is ranked 
in the bottom 50% of all schools in Tlinois.[38][39] This aligns with what our model expects, as these 


schools are expected to be suffering due to the high risk index in these times of teacher shortage crisis. 


This risk index is extremely versatile, as it can be applied to any school regardless of time 
or location, as it only requires data regarding the factors. Furthermore, by finding the index of 
staple good schools, one can compare indices between schools to see if their school is of higher or 


lower risk. 
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6 Recommendations 


We now propose recommendations for action that can be taken by the Chicago Public School 
District to help reduce homicide rates. These recommendations will focus on three most pressing 
factors of teacher shortages based upon our Random Forest algorithm: excessive enrollment, student 
misconduct, and poor leadership. Additionally, we outlined a possible budgeting plan for CPS to 


follow as these changes are bound to cost a lot of money. 


6.1 Enrollment 


As our random forest depicts, enrollment is consistently significant factor towards teacher shortages. 
Therefore, it is clearly a pressing issue which, once solved, could greatly reduce teacher vacancies 
in Chicago. One way to make sure teachers are content with school size is the addition of more 
schools. As mentioned, most teachers report higher levels of happiness and morale when teaching 
at smaller schools, usually attributing it towards closer student-teacher relationships and more 
administrative attention. The only exception to this is large magnet-type schools in which there 
was a high student population. This is because the population is due to the number of applicants 
wanting to enroll and the school choosing its students versus a regular large public high school where 
students need to attend and a moral obligation to fulfill. Therefore, we recommend large public 
schools to implement often faculty meetings, in which administrative officers can listen to teacher 
ideas and complaints to form a closer and more attentive relationship with teachers. Furthermore, 
we recommend schools to promote positive learning environments, utilizing systems of school wide 
core values and accessible communication lines to teachers, as it has been shown to foster better 


teacher student relationships. [40] 


6.2 Misconduct 


Alongside enrollment, student misconduct and behavior was shown to be a significant factor as 
well in teacher vacancies. One way to minimize misconduct would be to implement rehabilitation 
in schools, as opposed to traditional solutions such as detention or suspension. Things such as 
educational courses and community service following a misconduct can help students grow to learn 
how to be better people and improve themselves, rather than simply punishing them and allowing 
the misconduct to occur again. As shown by the following study, rehabilitation can reduce further 
offences by around 62%. Thus, while traditional punishments may improve student behavior on a 
short term, on a long term scale, in order to truly solve the problem we must replace such traditional 


punishments with rehabilitation efforts. [41] 


6.3 Leadership 


As shown in our random forest model, leadership is a significant factor as well in causing teacher 
shortages. To eliminate this problem, we suggest that CPS creates mandatory leadership sessions 
and classes, where school administrators such as principals can work on improving their leadership 
continuously. This helps as school administrators often went to school a long time ago, and 
consistently taking classes can ensure that they stay qualified to administrate a school. Additionally, 


another solution is to raise the necessary qualifications to become an administrator unlike what 
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schools are currently doing. While this may seem counter intuitive and against the actions that are 
done right now - lowering requirements to get more personnel, in the long run, this will be more 
beneficial as it will raise teacher retention rates and lower teacher attrition rates, better solving the 


problem on a long term scale. [42] 


6.4 Overall Funding 


In reality, funding is heavily limited for public schools, as cannot be solely spent on solving the issue 
of teacher shortages. For all schools, resources, maintenance, and other bills all require portions 
of the overall funds. Therefore, we suggest the following strategy to maintain a healthy spending 
plan while still regulating the teacher shortage crisis. As shown from our model for student-teacher 
ratios, it fluctuates around our base linear regression. When the model falls under this trend 
line, we suggest lowering the expenditure on the solutions mentioned above. This is because 
during these times, teacher shortage is relatively moderate and needs more minimal regulation. 
Then, during the times when student-teacher ratios peak past our linear regression, we recommend 
significantly increasing the budget spent on the solutions we mentioned, as these are times when 
teacher shortages are high enough to significantly impact the education and growth of students. 
Over time, this strategy should slowly flatten the peaks of our function, resulting in an altered 


model which will oscillate around a slowly decreasing sloped line. 


For example, in our 20 year estimate, we recommend decreasing funding from 2019 to 2024, 
as our model predicts student-teacher ratios to be relatively lower during these times, needing 
less attention and direct countermeasures. Then, from 2024 to 2032, we recommend increasing 
the budget spent on combating teacher shortages. For this, as shown by our curve, the amount 
spent should increase alongside the increase in the student-teacher ratio. Therefore, we recommend 
slowly scaling up the budget from 2024 to 2028, as that prevents ever going over the budget as the 
school can slowly make plans and gradually cut down on other costs. Then, as our model predicts 
teacher shortages going down, we recommend slowly cutting back on this budget, creating plans to 


reallocate the funds as the school gradually returns to regular spending. 
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7 Conclusion 


Overall, our model successfully determined the main factors contributing to the current teacher 
shortage crisis, predicted general trend of student teacher ratios into the future, and analyzed the 
risk that teacher shortages bring to the future generations of Chicagoans. We analyzed the risks 
of teacher shortages in specific schools within CPS, and provided recommendations for CPS to 
implement in order to mitigate teacher shortages. We were able to determine that the main factors 
that we needed to mitigate were excessive enrollment, student misconduct, and poor leadership. 
As such, we outlined the following recommendations: the addition of more schools, the transition 
from traditional punishment methods to rehabilitation methods, and increasing the requirements 
to become a school administrator. We then outlined a financial plan that would target funding at 
specific years that our Fourier transform determined to be years with high teacher vacancies that 


would help with budgeting and mitigate the problem further. 
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A Appendix 


Below is our whole random forest and with its main branches broken down. 


Figure 3: Random Forest Model 
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po 


Five_Essential_Leadership_Index < 67.5 
gini = 0.497 
samples = 10 
value = [6, 7] 
Class = 1.0 


Figure 4: Leftmost Branch of Random Forest 
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Index < 67.5 


Figure 5: Middle Branch of Random Forest 
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Performance_Math < 57.132 
gini = 0. 


samples = 3 
value = [3, 3] 
class = 0.0 


Figure 6: Rightmost Branch of Random Forest 
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B 


# In 


from 


Code Used 


itialize Google Colab Drive 


google.colab import drive 


drive.mount(’/content/drive’) 


# The path where all data is found 


path 


#Imp 


= ?drive/MyDrive/MTFCFiles/’ 


ort Libraries 


import graphviz 


import math 


import matplotlib.pyplot as plt 


import numpy as np 


import pandas as pd 


from 


7 from 


from 
from 


from 


scipy.optimize import curve_fit, fsolve 

scipy.signal import find_peaks 

sklearn.ensemble import RandomForestClassifier, RandomForestRegressor 
sklearn.feature_selection import SelectFromModel 


sklearn.inspection import partial_dependence, permutation_importance, 


PartialDependenceDisplay 


from 
from 
from 


from 


plt. 


sklearn.linear_model import LinearRegression 
sklearn.metrics import accuracy_score, r2_score 
sklearn.model_selection import train_test_split 


sklearn.tree import export_graphviz 


style .use(*ggplot?) 


# Our dependent variables 


targ 


et_list = [*Enrollment”,”Attendance_Avg_Daily?”,” 


Five_Essential_Leadership_Index’,’Misconduct’,’Performance_ELA’,’ 


Performance_Math’ ] 


# (Gal 
def 


#F 
fi 
df 
ke 


ean the data 


clean (year): 


ile path by year 
le = path+year+".csv" 
= pd.read_csv(file) 


ep_rows = [] 


# Fill in holes with average of the column and convert all the strings to floats 


for col in target_list: 


count = 0 
tot = 0 


for idx, value in df[col].items(): 
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49 if isinstance (value Str): 

50 if value == ’--’ or math.isnan(int(value)): 
51 df [col] [idx] = float("nan") 

52 continue 

53 value = int(value) 

54 df [col] [idx] = value 

55 if (math.isnan(value)): 


56 continue 


57 tot+=int (value) 
58 df [col] [idx] = int(value) 
59 count+=1 


60 keep_rows.append(idx) 
61 for idx, value in df[col].items(): 
62 if math.isnan(value): 


63 df [col] [idx]=tot/count 


66 # However, we use the drop row strategy, because it is more dependable 
67 # Drop rows strategy instead of avg strategy 


68 df = df.loc[df.index.isin(keep_rows)] 


71 # Just in case AVG Class Size is hole 
72 for idx, value in df["Avg_Class_Size"].items(): 
73 df = df.dropna(subset=["Avg_Class_Size"]) 


76 # Divide Misconduct by enrollment because bigger schools have a higher 
misconduct 
77 df[’Misconduct’] = df[’Misconduct’]/df["Enrollment"] 


80 # Create a new vacant col based on the Avg_Class_Size 

81 Vacant_Col = [] 

82 for index, row in df.iterrows(): 

84 

85 # Only if class size is > 18, there is a teacher vacancy 
86 if row["Avg_Class_Size"] > 18: 

87 Vacant_Col.append (1) 

88 else: 

89 Vacant_Col.append (0) 


90 df["Is_Vacant"] = Vacant Col 


91 return df 


94 # Display the Random Forest Feature Importance and Permutation Feature Importance 
tor teach year. Then display thel Partials Dependency = 
95 def dispYearRandom(year): 


98 # Get a clean dataset first 


99 df = clean(year) 
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Dependent factors and independent variable 
dí [target_list] 

df iiss Vacant] 

X.astype(’float’) 


<S «xX > 
Il 


y.astype(’float’) 


# Split the data into train and test 
X_train, X_test, y_train, y_test = train_test_split (Xx, y, test_size=0.2, 


random_state=0) 


# Train the random forest model 
clf = RandomForestClassifier (n_estimators=100, random_state=0) 


clf.fit(X_train, y_train) 


# Visualize the forest 

dot_data = export_graphviz(clf.estimators_[0], out_file=None, 
feature_names=target_list, 
class_names=[str(i) for i in clf.classes_], 
filled=True, rounded=True, 


special_characters=True) 


graph = graphviz.Source(dot_data) 


graph.render(’Decision_Tree’) 


# Predict the class for each X value 


y_pred = clf.predict(X_test) 


# Print the accuracy of the model 
accuracy = accuracy_score(y_test, y_pred) 


print ("Accuracy:", accuracy) 


# Determine the feature importances and display them 

importance = clf.feature_importances_ 

feat_importances = pd.Series(clf.feature_importances_, index=X.columns) 
feat_importances.nlargest (20) .plot (kind=’ barh’) 

plt.xlabel("Random Forest Feature Importance") 

feature_importances = [(feature, round(importance, 2)) for feature, importance 


in zip(target_list, importance) ] 


# Print the error values 


print(’Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred)) 


print(’Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred)) 

print(’Root Mean Squared Error:”, np.sqrt(metrics.mean_squared_error(y_test, 
y_pred))) 

print("Data Size:", df.shape) 


plt.title(year) 


for index, value in enumerate(feat_importances.nlargest(20)): 
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plt.text (value, index, str(round (value, 2))) 
plt.show() 


# Calculate permutation feature importance 


result = permutation_importance(clf, X, y, n_repeats=10, random_state=42) 


# Plot the feature importance scores and the error bars 

sorted_idx = result.importances_mean.argsort () 

plt.barh(range(X.shape[1]), result.importances_mean[sorted_idx], xerr=result. 
importances_std[sorted_idx]) 

plt.yticks(range(X.shape[11), [target_list[i] for i in sorted_idx]) 

plt.xlabel(’Permutation Importance’) 

plt.title(year) 


# Make sure the text fits 


plt.xlim(right=max(result.importances_mean[sorted_idx])+0.01) 


# Annotate the bar chart with correct values 

for i, v in enumerate (result.importances_mean[sorted_idx]): 
plt.text(v + 0.01, i, str(round(v, 3))) 

pit. show() 


# Plot the partial dependence graphs 
# We want the partial dependency for all factors in our cumulative graph, but 
only the most important one for the other years 


if year == "All Years": 


# Partial Dependence Display 
for i in range(sorted_idx.shape[0]-1,-1,-1): 
fig, ax = plt.subplots() 


disp = PartialDependenceDisplay.from_estimator(clf, X, features=[sorted_idx[ 


il], feature_names=target_list, ax=ax) 
ax.set_xlabel (target_list[sorted_idx[i]]) 
plt.title(year) 
plt.show() 


else: 


# Same here, Partial Dependence Display 

fig, ax = plt.subplots() 

disp = PartialDependenceDisplay.from_estimator(clf, X, features=[sorted_idx 
[-1]], feature_names=target_list, ax=ax) 

ax.set_xlabel (target_list [sorted_idx[-1]]) 

plt.title(year) 

plt.show() 


# Print the scores and errors of each partial dependency 


for i in result.importances_mean.argsort()[::-1]: 
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199 printf -Atargetelist fil) <k: fresult importances meanlilk 3f +/— Aresult. 
importances stdlilk sfr Ctest score: Telf sic ome CX ye A on) 


203 # Too many warnings, let’s get rid of them 
204 import warnings 


205 warnings.filterwarnings(’ ignore’) 


207 # Display for each year 

208 dispYearRandom("All Years") 
209 dispYearRandom("2014") 

210 dispYearRandom("2015") 

211 dispYearRandom("2016") 

212 dispYearRandom("2017") 

213 dispYearRandom("2018") 

214 dispYearRandom("2019") 


218 # Our equation 
219 def Fourier(x, a, b, c, d, e, f, g, hb): 
220 return a + b*x + c*np.sin(d*x + e) + f*np.cos(d*x + e) + g*np.sin(h*x) + hx*np. 


cos (g*x) 


223 # Linear Regression and Fourier Transform 
224 def LinearAndFourier(school, remove, data): 
225 # Read Data 

226 data = pd.read_csv(path+data) 

227 # Remove the last n years 

228 if remove > 0: 


229 data = data[:-remove] 


data[’Years Since’].values.astype(float) 


data[school].values.astype(float) 


N 
< 
tl 


233 # Fit the linear regression 


234 lr_model = LinearRegression().fit(X.reshape(-1, 1), y.reshape(-1, 1)) 
235 lr_score = lr_model.score(X.reshape(-1, 1), y.reshape(-1, 1)) 

236 print(’Linear regression R-squared:’, lr_score) 

238 # Fit our Fourier Transform model 

239 popt, pcov = curve_fit(Fourier, X, y) 


241 X_future = np.concatenate((X, np.arange(X.max()+1, X.max()+21))) 


243 # Generate predictions using the fitted function for both past and future 
years 

244 y_predi = Fourier(X,*popt) 

245 r2 = r2_score(y, y_pred1) 

246 print OR squared value:", r2) 


247 y_pred2 = Fourier(X_future, *popt) 
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# Plot the data, the fitted curve, and the predicted values 

plt.scatter(X, y) 

plt.plot(X, y_pred2[:len(X)], color=’red’, label=’Fitted curve’) 
plt.plot(X_future[len(X):], y_pred2[len(X):], color=’green’, linestyle=’--’, 
label=’Predicted values’) 

plt.plot(X_future.reshape(-1, 1), lr_model.predict (X_future.reshape(-1, 1)), 
color=’blue’, label=’Linear regression’) 

plt.xlabel(’Years Since 1997?) 

plt.ylabel(school+’ Student to Teacher Ratio’) 

plt.legend () 

plt.title("Predictions for Future Student to Teacher Ratio for " + school) 
plt.show() 

m = lr_model.coef_ [0] [0] 

b = lr_model.intercept_ [0] 


# Print the equation of the line 
print (Equation of the dane. y = M: .2fix + Ac. oth format m, b))) 


# Remove 3 years due to COVID 

# Generate Elementary and High School Graphs 
LinearAndFourier(’Elementary’ ,3,’97-22.csv’) 
LinearAndFourier(’High School’ ,3,’97-22.csv’) 


# Determine the High Risk and Low Risk school 
df = clean(*2019?) 


# The ranges of peaks and valleys in our partial dependence plots 


7 minHigh = [[650] , [93] , [0] , [0] , [50] , [45,60] ,[95000]] 
maxHigh = [[1200] , [97] , [20] , [1] , [70] , [55,70] , [98000]] 
minLow = [[200],[89,98],[80],[11]1,[0],[0,95,55] ,[104000]] 
maxLow = [[320] , [93,100] , [100] , [1.30] , [25] ,[20,100,58] ,[107000]] 


for index, row in df.iterrows(): 
inFullRange = True 
for i in range(6): 
inRange = False 


for x in range(len(minHigh[i])): 


# Check if it is in range for high risk 

inRange |= row[target_list[i]] >= minHigh[i][x] and row[target_list[i]] <= 
maxHigh [i] [x] 
inFullRange &= inRange 


# If the school satisfies all ranges, print it out as high risk 
if inFullRange: 
print ("High Risk: ",row) 


# Same but for low risk 


inFullRange = True 


for i in range(6): 
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tot = 0 
inRange = False 
for x in range(len(minLow[i])): 
inRange |= row[target_list[i]] >= minLow[i][x] and row[target_list[i]] <= 
maxLow[i] [x] 
inFullRange &= inRange 
if inFullRange: 


print("Low Risk: ", row) 
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